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Abstract — In this work the information loss in deterministic, 
memoryless systems is investigated by evaluating the conditional 
entropy of the input random variable given the output random 
variable. It is shown that for a large class of systems the infor- 
mation loss is finite, even if the input is continuously distributed. 
Based on this finiteness, the problem of perfectly reconstructing 
the input is addressed and Fano-type bounds between the 
information loss and the reconstruction error probability are 
derived. 

For systems with infinite information loss a relative measure 
is defined and shown to be tightly related to Renyi information 
dimension. Employing another Fano-type argument, the recon- 
struction error probability is bounded by the relative information 
loss from below. 

In view of developing a system theory from an information- 
theoretic point-of-view, the theoretical results are illustrated by 
a few example systems, among them a multi-channel autocorre- 
lation receiver. 

Index Terms — Data processing inequality, Fano's inequality, 
information loss, Renyi information dimension, system theory 

I. Introduction 

When opening a textbook on linear HI or nonlinear |]2] 
input-output systems, the characterizations one typically finds 
- aside from the difference or differential equation defining 
the system - are almost exclusively energy-centered in nature: 
transfer functions, input-output stability, passivity, lossless- 
ness, and the £2 or energy/power gain are all defined using the 
amplitudes (or amplitude functions) of the involved signals, 
therefore essentially energetic in nature. When opening a 
textbook on statistical signal processing Q or an engineering- 
oriented textbook on stochastic processes J4), one can add 
correlations, power spectral densities, and how they are af- 
fected by linear and nonlinear systems (e.g., the Bussgang 
theorem [4 Thm. 9-17]). By this overwhelming prevalence of 
energetic measures and second-order statistics, it is no surprise 
that many problems in system theory or signal processing are 
formulated in terms of energetic cost functions, e.g., the mean- 
squared error. 

What one does not find in all these books is an information- 
theoretic characterization of the system at hand, despite the 
fact that such a characterization is strongly suggested by an 
elementary theorem: the data processing inequality. We know 
that the information content of a signal (be it a random vari- 
able or a stochastic process) cannot increase by deterministic 
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processing, just as, loosely speaking, a passive system cannot 
increase the energy contained in a signafl- 

Clearly, the information lost in a system not only depends 
on the system but also on the signal carrying this information; 
the same holds for the energy lost or gained. While there 
is a strong connection between energy and information for 
Gaussian signals (entropy and entropy rate of a Gaussian 
signal are related to its variance and power spectral density, 
respectively), for non-Gaussian signals this connection degen- 
erates to a bound by the max-entropy property of the Gaussian 
distribution. Energy and information of a signal therefore can 
behave completely differently when fed through a system. But 
while we have a definition of the energy loss (namely, the 
inverse £2 gain), an analysis of the information loss in a 
system is still lacking. 

It is the purpose of this work to close this gap and to 
propose the information loss - the conditional entropy of the 
input given the output - as a general system characteristic, 
complementing the prevailing energy-centered descriptions. 
The choice of this conditional entropy is partly motivated by 
the data processing inequality (cf. Definition [1} and justified 
by a recent axiomatization of information loss 

At present we restrict ourselves to memoryless systems^ 
operating on (multidimensional) random variables. The reason 
for this is that already for this comparably simple system class 
a multitude of questions can be asked (e.g., what happens if we 
lose an infinite amount of information?), to some of which we 
intend to present adequate answers (e.g., by the introduction 
of a relative measure of information loss). This manuscript can 
thus be regarded as a first small step towards a system theory 
from an information-theoretic point-of-view; we hope that the 
results presented here will ley the foundation for some of the 
forthcoming steps, such as extensions to stochastic processes 
or systems with memory. 

A. Related Work 

To the best of the authors' knowledge, very few results 
about the information processing behavior of deterministic 
input-output systems have been published. Notable exceptions 
are Pippenger's analysis of the information lost in the mul- 
tiplication of two integer random variables [7 1 and the work 
of Watanabe and Abraham concerning the rate of informa- 
tion loss caused by feeding a discrete-time, finite-alphabet 
stationary stochastic process through a static, non-injective 

'At least by not more than a finite amount, cf. (5J- 

2 Since these systems will be described by (typically) nonlinear functions, 
we will use "systems" and "functions" interchangeably. 
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function |8|. Moreover, in (9) the authors made an effort to 
extend the results of Watanabe and Abraham to dynamical 
input-output systems with finite internal memory. All these 
works, however, focus only on discrete random variables and 
stochastic processes. 

Slightly larger, but still focused on discrete random vari- 
ables only, is the field concerning information-theoretic cost 
functions: The infomax principle [10], the information bottle- 
neck method [11| using the Kullback-Leibler divergence as 
a distortion function, and system design by minimizing the 
error entropy (e.g., [12]) are just a few examples of this recent 
trend. Additionally, Lev's approach to aggregating accounting 
data [13], and, although not immediately evident, the work 
about macroscopic descriptions of multi-agent systems Ifl4l 
belong to that category. 

A system theory for neural information processing has been 
proposed by Johnson^ in fl5l . The assumptions made there 
(information need not be stochastic, the same information can 
be represented by different signals, information can be seen 
as a parameter of a probability distribution, etc.) suggest the 
use of the Kullback-Leibler divergence as a central quantity. 
Although these assumptions are incompatible with ours, some 
similarities exist (e.g., the information transfer ratio of a 
cascade in [15] and Proposition |4). 

Aside from these, the closest the literature comes to these 
concepts is in the field of autonomous dynamical systems 
or iterated maps: There, a multitude of information-theoretic 
characterizations are used to measure the information transfer 
in deterministic systems. In particular, the information flow 
between small and large scales, caused by the folding and 
stretching behavior of chaotic one-dimensional maps was 
analyzed in |[16l : the result they present is remarkably sim- 
ilar to our Proposition Q Another notable example is the 
introduction of transfer entropy in IfTTl . IfTSl to capture the 
information transfer between states of different (sub-)systems. 
An alternative measure for the information exchanged between 
system components was introduced in [19|. Information trans- 
fer between time series and spatially distinct points in the 
phase space of a system are discussed, e.g., in lEUl . [21 1. 
Notably, these works can be assumed to follow the spirit of 
Kolmogorov and Sinai, who characterized dynamical systems 
exhibiting chaotic behavior with entropy E2l - ll24l . cf. 11251 . 

Recently, independent from the present authors, Baez et 
al. H took the reverse approach and formulated axioms for 
the information loss induced by a measure-preserving func- 
tion between finite sets, such as continuity and functoriality 
(cf. Proposition Q] in this work). They show that the difference 
between the entropy of the input and the entropy of the output, 
or, equivalently, the conditional entropy of the input given the 
output is the only function satisfying the axioms, thus adding 
justification to one of our definitions. 



3 Interestingly, Johnson gave a further motivation for the present work by 
claiming that "Classic information theory is silent on how to use information 
theoretic measures (or if they can be used) to assess actual system perfor- 
mance". 



B. Outline and Contributions 

The organization of the paper is as follows: In Section UTI we 
give the definitions of absolute and relative information loss 
and analyze their elementary properties. Among other things, 
we prove a connection between relative information loss and 
Renyi information dimension and the additivity of absolute 
information loss for a cascade. We next turn to a class of 
systems which has finite absolute information loss for a real- 
valued input in Section[]Tj] A connection to differential entropy 
is shown as well as numerous upper bounds on the information 
loss. Given this finite information loss, we present Fano- 
type inequalities for the probability of a reconstruction error. 
Section [TV] deals with systems exhibiting infinite absolute 
information loss, e.g., quantizers and systems reducing the 
dimensionality of the data, to which we apply our notion 
of relative information loss. Presenting a similar connection 
between relative information loss and the reconstruction er- 
ror probability establishes a link to analog compression as 
investigated in [26|. We apply our theoretical results to two 
larger systems in Section [V] a communications receiver and 
an accumulator. It is shown that indeed both the absolute and 
relative measures of information loss are necessary to fully 
characterize even the restricted class of systems we analyze in 
this work. Eventually, Section [VJJ is devoted to point at open 
issues and lay out a roadmap for further research. 

II. Definition and Elementary Properties of 
Information Loss and Relative Information Loss 

A. Notation 

We adopt the following notation: Random variables (RVs) 
are represented by upper case letters (e.g., X), lower case 
letters (e.g., x) are reserved for (deterministic) constants or 
realizations of RVs. The alphabet of an RV is indicated by a 
calligraphic letter (e.g., X). The probability distribution of an 
RV X is denoted by Px- Similarly, Pxy and Px\y denote 
the joint distribution of the RVs X and Y and the conditional 
distribution of X given Y, respectively. 

If A" is a proper subset of the A r -dimensional Euclidean 
space ~R N and if Px is absolutely continuous w.r.t. the N- 
dimensional Lebesgue measure fi (in short, Px <C l^ N ), then 
Px possesses a probability density function (PDF) w.r.t. the 
Lebesgue measure, which we will denote as fx- Conversely, 
if the probability measure Px is concentrated on an at most 
countable set of points, we will write px for its probability 
mass function, omitting the index whenever it is clear from 
the context. 

We deal with functions of (real-valued) RVs: If for example 
(X^XjPx) is the (standard) probability space induced by 
the RV X and (3^, *B;y ) is another (standard) measurable space, 
and g: X — > y is measurable, we can define a new RV as 
Y = g(X). The probability distribution of Y is 

VBeKy. P Y (B)=P x (g- 1 {B}) (1) 

where = {x E X : g(x) G B} denotes the preimage 

of B under g. Abusing notation, we write Py{u) for the 
probability measure of a single point instead of Py{{y})- 
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In particular, if g is a quantizer which induces a partition 
V of X, we write X for the quantized RV. The uniform 
quantization of X with hypercubes of side length =L is 
denoted by 

[2 n X\ 



s~ X n 



X n — 



2™ 



(2) 



where the floor operation is applied element-wise if X is a 
multi-dimensional RV. The partition induced by this uniform 
quantizer will be denoted as^ V n — {Xu }. Note also that 
the partition V n gets refined with increasing n (in short, V n >- 
V n+ i). 

Finally, H(-), h(-), i?2(")> anc ^ ^('! ') denote the entropy, 
the differential entropy, the binary entropy function, and the 
mutual information, respectively. Unless noted otherwise, the 
logarithm is taken to base two, so all entropies are measured 
in bits. 

B. Information Loss 

A measure of information loss in a deterministic input- 
output system should, roughly speaking, quantify the differ- 
ence between the information available at its input and its 
output. While for discrete RVs this amounts to the difference 
of their entropies, continuous RVs require more attention. 
To this end, in Fig. [T] we propose a model to compute the 
information loss of a system which applies to all real-valued 
RVs. 

In particular, we quantize the system input with partition 
V n and compute the mutual information between the input X 
and its quantization X n , as well as the mutual information 
between the system output Y and X n . The first quantity is an 
approximation of the information available at the input, while 
the second approximates the information shared between input 
and output, i.e., the information passing through the system 
and thus being available at its output. By the data processing 
inequality (cf. Il27l Cor. 7.16]), the former cannot be smaller 
than the latter, i.e., I(X n ; X) > I(X n ;Y), with equality if the 
system is described by a bijective function. We compute the 
difference between these two mutual informations to obtain 
an approximation of the information lost in the system. For 
bijective functions, for which the two quantities are equal, 
the information loss will vanish, as suggested by intuition: 
Bijective functions describe lossless systems. 

Refining the partition yields better approximations; we thus 
present 

Definition 1 (Information Loss). Let X be an RV with 
alphabet X, and let Y — g(X). The information loss induced 
by g is 

L(X^Y)= lim (l(X n ; X) - I(X n ; Y)) = H(X\Y). 

(3) 

Along the lines of Il27l Lem. 7.20] one obtains 

I{X n ;X) - I{X n ;Y) = H{X n ) - H{X n \X) 

-H(X n ) + H(X n \Y) (4) 
= H(X n \Y) (5) 



4 E.g., for a one-dimensional RV, the fc-th element of Vn is X, 
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Fig. 1. Model for computing the information loss of a memoryless input- 
output system g. Q is a quantizer with partition V n - 
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Fig. 2. Cascade of systems: The information loss of the cascade equals the 
sum of the individual information losses of the constituent systems. 



since X n is a function of X. With the monotone convergence 
of H(X n \Y = y) /* H(X\Y = y) implied in J27] Lem. 7.18] 
it follows that L(X ->• Y) = H(X\Y). Indeed, for a discrete 
input RV X we obtain L(X ->• Y) = H(X) - H(Y). 

While L(X -> Y) and H(X\Y) can be used interchange- 
ably, we will stick to the notation L(X — > Y) to make clear 
that Y is a function of X. 

For discrete input RVs X or stochastic systems (e.g., 
communication channels) the mutual information between the 
input and the output I(X;Y), i.e., the information transfer, 
is an appropriate characterization. In contrast, deterministic 
systems with non-discrete input RVs usually exhibit infinite 
information transfer I(X; Y). As we will show in Section [TTT1 
there exists a large class of systems for which the information 
loss L(X — > Y) remains finite, thus allowing to give a 
meaningful description of the system. 

One elementary property of information loss, which 
will prove useful in developing a system theory from an 
information-theoretic point-of-view, is found in the cascade 
of systems (see Fig. |2j. We maintain^] 

Proposition 1 (Information Loss of a Cascade). Consider 
two functions g: X — > y and h: y — » Z and a cascade 
of systems implementing these functions. Let Y = g{X) and 
Z = h(Y). The information loss induced by this cascade, 
or equivalently, by the system implementing the composition 
(h o g){-) = h{g{-)) is given by: 

L(X -> Z) = L(X + L(Y -> Z) (6) 

Proof: Referring to Definition [T] and ||28] Ch. 3.9] we 



5 In (6) this property was formulated as an axiom desirable for a measure 
of information loss. 



obtain 



L{X 



Z)=H(X\Z) 
= H(Y\Z) 
= H(Y\Z) 



H(X\Y,Z) 
H(X\Y) 



(7) 
(8) 
(9) 



since Y and (Y, Z) are mutually subordinate. ■ 
A sufficient condition for the information loss to be infinite 
is presented in 

Proposition 2 (Infinite Information Loss). Let g: X —> y, 
X C M. N , and let the input RV X be such that its proba- 
bility measure Px has an absolutely continuous component 
P x c -C fi which is supported on X. If there exists a set 
B C y of positive Py -measure such that the preimage 
is uncountable for every y £ B, then 

L(X Y) = oo. (10) 

Proof: We write the information loss L(X Y) as 

L(X -> y) = H(X\Y) = f H(X\Y = y)dP Y {y) (11) 

Jy 



> 



H{X\Y = y)dP Y (y) 



(12) 



since B C y. By assumption, the conditional probability mea- 
sure Px\y=y is not concentrated on a countable set of points 
(the preimage of y under g is uncountable, and the probability 
measure Px has an absolutely continuous component on all 
X) one obtains H(X\Y = y) = oo for all y € B. The proof 
follows from P Y (B) > 0. ■ 
It will be useful to explicitly state the following 



Corollary 1. Let P x < n N - 

such that Py(y*) > 0, then 



If there exists a point y* G y 



L(X -> Y) = oo. 



(13) 



Proof: Since y* has positive Py -measure and since 
Px <C fi N , the preimage of y* under g needs to be uncount- 
able. ■ 
In other words, if the input is continuously distributed and 
if the distribution of the output has a non-vanishing discrete 
component, the information loss is infinite. A particularly 
simple example for such a case is a quantizer: 

Example 1 (Quantizer). We now look at the information loss 
of a scalar quantizer, i.e., of a system described by a function 



g{x) = [x\. 



(14) 



With the notation introduced above we obtain Y = Xq = 
g{X). Assuming that X has an absolutely continuous distri- 
bution (Px -C M)> there will be at least one point y* for 
which Pr(y = y*) = P Y {y*) = P x ([y*,y* + 1)) > 0. The 
conditions of Corollary [1] are thus fulfilled and we obtain 



L(X -> Xq) = oo. 



(15) 



This simple example illustrates the information loss as the 
difference between the information available at the input and 



l{x) 



Fig. 3. The center clipper - an example for a system with both infinite 
information loss and information transfer. 



the output of a system: While in all practically relevant cases 
a quantizer will always have finite information at its output 
(H{Y) < oo), the information at the input is infinite as soon 
as Px has a continuous component. 

While for a quantizer the mutual information between the 
input and the output may be a more appropriate characteri- 
zation because it remains finite, the following example shows 
that also mutual information has its limitations: 

Example 2 (Center Clipper). The center clipper, used for, 
e.g., residual echo suppression ||29l , can be described by the 
following function (see Fig. [3]): 



x, if |af| > c 
0, otherwise 



(16) 



Assuming again that Px <C /U and that < Px([— c, c]) < 1, 
with Corollary [T] the information loss becomes infinite. On the 
other hand, there exists a subset of X x y (namely, a subset of 
the line x = y) with positive Pxy measure, for which PxPy 
vanishes. Thus, with 11281 Thm. 2.1.2] 



I(X;Y) 



(17) 



C. Relative Information Loss 

As the previous example shows, there are systems for 
which neither information transfer (i.e., the mutual information 
between input and output) nor information loss provides suf- 
ficient insight. For these systems, a different characterization 
is necessary, which leads to 

Definition 2 (Relative Information Loss). The relative infor- 
mation loss induced by Y = g(X) is defined as 



l(X ^Y)= lim 



H(X n \Y) 
H(X n ) 



(18) 



provided the limit exists. 



One elementary property of the relative information loss is 
that l(X — > Y) € [0, 1], due to the non-negativity of entropy 
and the fact that H(X n \Y) < H(X n ). The relative informa- 
tion loss is related to the Renyi information dimension, which 
we will establish after presenting 
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Definition 3 (Renyi Information Dimension QUID . The infor- 
mation dimension of an RV X is 



d(X) 



lim 

n— voo 



H(x n ) 



(19) 



provided the limit exists and is finite. 



We adopted this definition from Wu and Verdu, who showed 
in l26l Prop. 2] that it is equivalent to the one given by 
Renyi in [30]. Note further that we excluded the case that 
the information dimension is infinite, which may occur if 
H(Xq) — oo [26 Prop. 1]. Conversely, if the information 
dimension of an RV X exists, it is guaranteed to be finite if 
H(X ) < oo [30| or if E{|X| e } < oo for some e > l26l . 
Aside from that, the information dimension exists for discrete 
RVs and RVs with probability measures absolutely contin- 
uous w.r.t. the Lebesgue measure on a sufficiently smooth 
manifold [30|, for mixtures of RVs with existing information 
dimension 1261 . (301, [31 1, and self-similar distributions gener- 
ated by iterated function systems [26|. Finally, the information 
dimension exists if the MMSE dimension exists [32 Thm. 8]. 
For the remainder of this work we will assume that the 
information dimension of all considered RVs exists and is 
finite. 

We are now ready to state 

Proposition 3 (Relative Information Loss and Information 
Dimension). Let X be an N -dimensional RV with positive 
information dimension d{X). If d{X\Y = y) exists and is 
finite Py-a.s., the relative information loss equals 

d(X\Y) 



l(X -+ Y) 



d(X) 



(20) 



where d(X\Y) = J y d(X\Y = y)dP Y (y). 
Proof: From Definition [2] we obtain 

J y H(X n \Y = y)dP Y (y) 



l(X -+Y)= lim 



= lim 

n— f oo 



H(X n ) 

Iy mi^i dPY{y) 

H(X„) 



(21) 



(22) 



By assumption, the limit of the denominator and the expression 
under the integral both exist and correspond to d(X) and 
d(X\Y — y), respectively. Since for an M^-valued RV X 
the information dimension satisfies [26], [30] 



d(X\Y = y) <N 



(23) 



one can apply Lebesgue's dominated convergence theorem 
(e.g., 1331 ) to exchange the order of the limit and the integral. 
The limit of the numerator thus exists and we continue with 



l(X^Y) = 



Ihn^ J y ^ x j Y ^ dP Y (y) 
d(X) 

_ J y d(X\Y = y)dP Y (y) _ d(X\Y) 



(24) 



d{X) d(X) 

This completes the proof. ■ 
We accompany the relative information loss by its logical 
complement, the relative information transfer: 



Definition 4 (Relative Information Transfer). The relative 
information transfer achieved by Y = g(X) is 



t(X -> Y) 



I(X n ;Y) 
H(X n ) 



lim 



l-l(X^-Y) 



(25) 



provided the limit exists. 



While, as suggested by the data processing inequality, the 
focus of this work is on information loss, we introduce this 
definition to simplify a few proofs of the forthcoming results. 
In particular, we maintain 

Proposition 4 (Relative Information Transfer and Information 
Dimension). Let X be an RV with positive information dimen- 
sion d(X) and let g be a Lipschitz function. Then, the relative 
information transfer through this function is 



t(X -> Y) = 



d{Y) 

d(xy 



(26) 



Proof: See Appendix lAl ■ 
This result for Lipschitz functions is the basis for several of 
the elementary properties of relative information loss presented 
in the remainder of this section. Aside from that, it suggests 
that, at least for this restricted class of functions, 



d(X ) = d(X, Y) = d(X\Y) + d{Y) 



(27) 



holds. This complements the results of 11341 . where it was 
shown that the point-wise information dimension satisfies this 
chain rule (second equality) given that the conditional prob- 
ability measure satisfies a Lipschitz property. Moreover, the 
first equality was shown to hold for the point-wise information 
dimension given that Y is a Lipschitz function of X; for 
non-Lipschitz functions the point-wise information dimension 
of (X, Y) may exceed the one of X. If the same holds for 
the information dimension (which is the expectation over the 
point-wise information dimension) is an interesting question 
for future research. 

We can now present the counterpart of Proposition [1] for 
relative information loss: 

Proposition 5 (Relative Information Loss of a Cascade). 
Consider two Lipschitz functions g: X —¥ y and h: y — > Z 
and a cascade of systems implementing these functions. Let 
Y = g(X) and Z — h(Y). For the cascade of these systems 
the relative information transfer and relative information loss 
are given as 



t(X -+Z) = t(X -> Y)t(Y -> Z) 



(28) 



and 



l(X -> Z) 

= 1{X -> Y") + 1{Y Z) — l(X -> Y)l(Y -> Z) (29) 
respectively. 

Proof: The proof follows from Proposition 2] and Defini- 
tion m ■ 
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D. Interplay between Information Loss and Relative Informa- 
tion Loss 

We introduced the relative information loss to characterize 
systems for which the absolute information loss from Defi- 
nition Q] is infinite. The following result shows that, at least 
for input RVs with infinite entropy, an infinite absolute infor- 
mation loss is a prerequisite for positive relative information 
loss: 

Proposition 6 (Positive Relative Loss leads to Infinite Abso- 
lute Loss). Let X be such that H(X) = oo and let 
l(X -> Y) > 0. Then, L(X -> Y) = oo. 

Proof: We prove the proposition by contradiction. To this 
end, assume that L(X -> Y) = H(X\Y) = n < oo. Thus, 



l(X -> Y) = lim 



H(X n \Y) 

H(X n ) 
H(X\Y) 



(a) 

< lim 

( ^0 



(30) 

(31) 
(32) 



where (a) is due to data processing and (b) follows from 
H(X\Y) = K < oo and from H(X n ) -> H(X) = oo 
(e.g., (271 Lem. 7.18]). ■ 
Note that the converse is not true: There exist examples 
where an infinite amount of information is lost, but for which 
the relative information loss nevertheless vanishes, i.e., l(X — > 
Y) = (see Example [4] further below). 

III. Information Loss for Piecewise Bijective 
Functions 

In this section we analyze the information loss for a 
restricted class of functions and under the practically relevant 
assumption that the input RV has a probability distribution 
Px "C M supported on X. Let {Xi} be a partition of 
X C M. N , i.e., the elements Xi are disjoint and unite to X, 
and let Px(Xi) > for all i. We present 

Definition 5 (Piecewise Bijective Function). A piecewise 



bijective function g: X — > y, X,y C 
function defined in a piecewise manner: 



»JV 



gi(x), if x e Xx 
g2{x), if x e X 2 



is a surjective 



(33) 



where each gc Xi — > 3^ is bijective. Furthermore, the Jacobian 
matrix J g (-) exists on the closures of Xi, and its determinant, 
detj'g(-), is non-zero Px-a.s. 

A direct consequence of this definition is that also Py <C 
fi N . Thus, Py possesses a PDF fy w.r.t. the Lebesgue measure 
p N which, using the method of transformation (e.g. H p. 244], 
can be computed as 



Mv) 



E 

xi£g 1 [y] 



fxjxj) 
|det J g (xi 



(34) 



In addition to that, since the preimage g 1 [y] is countable 
for all yey, it follows that d(X\Y) = 0. But with d(X) = 
N from Px -C n we can apply Proposition |3] to obtain 
l(X — > Y) = 0. Thus, relative information loss will not tell 
us much about the behavior of the system. In the following, 
we therefore stick to Definition Q] and analyze the (absolute) 
information loss in piecewise bijective functions (PBFs). 

A. Information Loss in PBFs 
We present 

Proposition 7 (Information Loss and Differential Entropy). 

The information loss induced by a PBF is given as 

L(X -> Y) = h(X) - h{Y) + E {log |det J a (X)\} (35) 

where the expectation is taken w.r.t. X. 

Proof: See Appendix iBl ■ 
Aside from being one of the main results of this work, it 
also complements a result presented in J4] pp. 660]. There, it 
was claimed that 



h(Y) < h{X) +E{log\detJ B (X)\} 



(36) 



where equality holds if and only if g is bijective, i.e., a lossless 
system. This inequality results from 

fx(x) 



/r (<?(*)) > 



(37) 



\detj g (x)\ 

with equality if and only if g is invertible at x. Proposition [7] 
essentially states that the difference between the right-hand 
side and the left-hand side of ( l36l l is the information lost due 
to data processing. 

Example 3 (Square-Law Device and Gaussian Input). We 
illustrate this result by assuming that X is a zero-mean, unit 
variance Gaussian RV and that Y = X 2 . We switch in this 
example to measuring entropy in nats, so that we can compute 
the differential entropy of X as h(X) = | ln(27re). The output 
Y is a x 2 -distributed RV with one degree of freedom, for 
which the differential entropy can be computed as 11351 



/ l (y) = -(l + ln 7 r- 7 ) 



(38) 



where 7 is the Euler-Mascheroni constant 11361 pp. 3], The 
Jacobian determinant degenerates to the derivative, and using 
some calculus we obtain 

E {In \g\X)\} = E {In \2X\} = 1 (In 2 - 7) ■ (39) 

Applying Proposition [7] we obtain an information loss of 
L(X — > Y) = In 2, which after changing the base of the 
logarithm amounts to one bit. Indeed, the information loss 
induced by a square-law device is always one bit if the PDF 
of the input RV has even symmetry 071 . 

We note in passing that the statement of Proposition [7J 
has a tight connection to the theory of iterated function 
systems. In particular, [16| analyzed the information flow in 
one-dimensional maps, which is the difference between infor- 
mation generation via stretching (corresponding to the term 
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involving the Jacobian determinant) and information reduction 
via folding (corresponding to information loss). Ruelle [38 1 
later proved that for a restricted class of systems the folding 
entropy {L(X — > Y)) cannot fall below the information 
generated via stretching, and therefore speaks of positivity 
of entropy production. He also established a connection to 
the Kolmogorov-Sinai entropy rate. In [39| both components 
constituting information flow in iterated function systems are 
described as ways a dynamical system can lose information. 
Since the connection between the theory of iterated function 
maps, Kolmogorov-Sinai entropy rate, and information loss 
deserves undivided attention, we leave a more thorough anal- 
ysis thereof for a later time. 

We turn to an explanation of why the information loss 
actually occurs. Intuitively, the information loss is due to the 
non-injectivity of g, i.e., employing Definition [TJ due to the 
fact that the bijectivity of g is only piecewise. We will make 
this precise after introducing 

Definition 6 (Partition Indicator). Let W be a discrete RV 
which is defined as 



W 



if X € Xi 



(40) 



for all i. 



Proposition 8. The information loss is identical to the un- 
certainty about the set Xi from which the input was taken, 
i.e., 

L(X — > F) = H(W\Y). (41) 

Proof: See Appendix ICl ■ 
This proposition states that the information loss of a PBF 
stems from the fact that by observing the output one has 
remaining uncertainty about which element of the partition 
{Xi} contained the input value. Moreover, by looking at 
Definition [6] one can see that W is obtained by quantizing 
X with partition {Xi}. Consequently, the limit in Definition Q] 
is actually achieved at a comparably coarse partition. 
The result permits a simple, but interesting 



Corollary 2. Y and W together determine X, i.e., 
H(X\Y,W) = 0. 
Proof: Since W is obviously a function of X, 



(42) 



L(X -> Y) = H{X\Y) = H(X, W\Y) 

= H{X\W,Y) + H{W\Y) 

= H(X\W,Y) + L(X 



Y) (43) 



from which H(X\Y, W) = follows. ■ 
In other words, knowing the output value, and the element 
of the partition from which the input originated, perfect recon- 
struction is possible. We will make use of this in Section Ini-CI 
Before proceeding, we present an example where an infinite 
amount of information is lost in a PBF: 

Example 4 (Infinite Loss). 

Assume that we consider the following scalar function 




Fig. 4. Piecewise bijective function and input density leading to infinite loss 
(cf. Example [4} 



the interval (0,1]: 

g(x) = 2 n {x - 2"") if x € (2~ n , 2~™ +1 ], n e N (44) 

Assume further that the PDF of the input X is given as (see 
Fig. © 



fx(x) = 2" 



1 



1 



log(n+l) log(n + 2) y 

if are (2-™,2-™ +1 ], n e N. (45) 



As an immediate consequence, the output RV Y is uniformly 
distributed on (0, 1]. 

To apply Proposition [8] we need 

Pr(W = n\Y = y) = Pr(W = n) 

1 1 



log(n + l) log(n + 2)' (46) 

For this distribution, the entropy is known to be infinite [40 1, 
and thus 



L(X -+Y) = H(W\Y) = H(W) = oo. 



(47) 



B. Upper Bounds on the Information Loss 

The examples we examined so far were simple in the sense 
that the information loss could be computed in closed form. 
There are certainly cases where this is not possible, especially 
since the expressions involved in Proposition [7] may involve 
a logarithm of a sum. It is therefore essential to accompany 
the exact expressions by bounds which are more simple to 
evaluate. In particular, we present a corollary to Proposition [8] 
which follows from the fact that conditioning reduces entropy: 

Corollary 3. 

L(X -> Y) < H(W) (48) 
Note that in both Example [3] and [4] this upper bound holds 



(0,1] — > (0,1], mapping every interval (2 n ,2 n+1 ] onto with equality. 



The proof of Proposition [8] also allows us to derive the 
bounds presented in 

Proposition 9 (Upper Bounds on Information Loss). The 
information loss induced by a PBF can be upper bounded 
by the following ordered set of inequalities: 



L{X 



Y)< [ M^logcardCg- 1 ^])^ 

Jy 

< ess sup log card (q 1 [ 

yey 

< logcard^}) 



fv(y)dy 



(49) 

(50) 

(51) 
(52) 



where card(-B) is the cardinality of the set B. Bound (1491 > 
holds with equality if and only if 

xkeg- 1 [g(x)\ 

(53) 

If and only if this expression is constant Px-a.s., bounds (150b 
and (15 It are tight. Bound ( 152b holds with equality if and only 
if additionally Py(3^) = 1 far all i. 

Proof: See Appendix iDl ■ 
Note that all bounds of Proposition [9] hold with equality in 
Examples [3] and [4] Clearly, examples where the PDF of X and 
the absolute value of the Jacobian determinant are constant 
on X render the first bound (|49l tight (cf. BT1 conference 
version, Sect. VI]). Two other types of scenarios, where these 
bounds can hold with equality, are worth mentioning: First, for 
functions g: K — > M equality holds if the function is related to 
the cumulative distribution function of the input RV such that, 
for all x, \g'(x)\ = fx(x) (see extended version of [37|). 
The second case occurs when both function and PDF are 
"repetitive", in the sense that their behavior on X\ is copied to 
all other and that, thus, fx(xi) and |det l 7 g (xi)| is the same 
for all elements of the preimage <7 _1 [2/]. Example [3] represents 
such a case. 

C. Reconstruction and Reconstruction Error Probability 

We now investigate connections between the information 
lost in a system and the probability for correctly reconstructing 
the system input. In particular, we present a series of Fano-type 
inequalities between the information loss and the reconstruc- 
tion error probability. This connection is sensible, since the 
preimage of every output value is an at most countable set. 

Intuitively, one would expect that the fidelity of a recon- 
struction of a continuous input RV is best measured by some 
distance measure "natural" to the set X, such as, e.g., the 
mean absolute distance or the mean squared-error (MSE), 
if X is a subset of the Euclidean space. However, as the 
following example shows, there is no connection between the 
information loss and such distance measures: 

Example 5 (Energy and Information behave differently). 
Consider the two functions g\ and 172 depicted in Fig. |5j 
together with a possible reconstructor (see below). Assume 




Fig. 5. Two different functions gi and #2 with the same information loss, 
but with a different mean-squared reconstruction error. The corresponding 
reconstractors are indicated by thick red lines. (It is immaterial whether the 
functions are left- or right-continuous.) 



further that the input RV is uniformly distributed on [—a, a]. 
It follows that 

L(X ^ 9l (X)) = L(X ^ g 2 (X)) = 1 (54) 

The mean-squared reconstruction error E {(X — r(F)) 2 }, 
however, differs for Y — gi(X) and Y = g2(X), since for 
gi the Euclidean distance between x and r{y) is generally 
smaller. In particular, decreasing the value of qi even further 
and extending the function accordingly would allow us to 
make the mean-squared reconstruction error arbitrarily small, 
while the information loss remains unchanged. 

Clearly, a reconstructor trying to minimize the mean- 
squared reconstruction error will look totally different than 
a reconstructor trying to recover the input signal with high 
probability. Since for piecewise bijective functions a recovery 
of X is possible (in contrast to noisy systems, where this 
is not the case), in our opinion a thorough analysis of such 
reconstructors is in order. 

Aside from being of theoretical interest, there are practical 
reasons to justify the investigation: As already mentioned in 
Section IIII-B1 the information loss is a quantity which is not 
always computable in closed form. If one can thus define a 
(sub-optimal) reconstruction of the input of the output for 
which the probability P e of error is easy to calculate, the 
Fano-type bounds would yield yet another set of upper bounds 
on the information loss. But also the reverse direction is of 
practical interest: Given the information loss L(X —> Y) 
of a system, the presented inequalities allow one to bound 
the reconstruction error P e . For example, one might want to 
obtain performance bounds of a non-coherent communications 
receiver (e.g., energy detector) in a semi-coherent broadcast 
scenario (e.g, combining pulse-position modulation and phase- 
shift keying, as in the IEEE 802.15.4a standard 11421 ). 
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We therefore present Fano-type inequalities to bound the 
reconstruction error probability via information loss. Due to 
the peculiarities of entropy pointed out in [43], however, we 
restrict ourselves to finite partitions {Xi}, guaranteeing a finite 
preimage for every output value. We note in passing that 
the results derived in this subsection not only apply to the 
mitigation of non-injective effects of deterministic systems, 
but to any reconstruction scenario where the cardinality of the 
input alphabet depends on the actual output value. 

We start with introducing 

Definition 7 (Reconstructor & Reconstruction Error). Let 
r: y — > X be a reconstructor. Let E denote the event of 
a reconstruction error, i.e., 



E 



1, if r(Y) ^ X 
0, i£r(Y)=X' 

The probability of a reconstruction error is given by 

P e =Pr(£ = l)= f P e (y)dP Y (y) 

Jy 

where P e (y) = Pr(E = l\Y = y). 



(55) 



(56) 



In the following, we will investigate two different types of 
reconstructors: the maximum a-posteriori (MAP) reconstruc- 
tor and a sub-optimal reconstructor. The MAP reconstructor 
chooses the reconstruction such that its conditional probability 
given the output is maximized, i.e., 

rMAp(y) = arg max Pr(X = x k \Y = y). (57) 

In other words, with Definition [7] the MAP reconstructor 
minimizes P e (y)- Interestingly, this reconstructor has a simple 
description for the problem at hand: 

Proposition 10 (MAP Reconstructor). The MAP estimator for 
a PBF is 

(58) 



where 



k = are 



nviAp(y) = g k 1 (y) 

f fx{gj\y)) 



max 
-i 



(59) 



kgVM&X^Jgfa 1 {y))\. 

Proof: The proof follows from Corollary [2] which states 
that, given Y = y is known, reconstructing the input essen- 
tially amounts to reconstructing the partition from which it 
was chosen. The MAP reconstructor thus can be rewritten as 



^MAp(y) = argmaxp(j|y). 



(60) 



where we used the notation from the proof of Proposition [8] 
From there, 

|dctj g ( ff - 1 (y))|/^(y)' ' J i W> T v (61) 

0, if.gr 1 ( y ) = 

This completes the proof. ■ 
We derive Fano-type bounds for the MAP reconstructor, or 
any reconstructor for which r(y) g <7 _1 [y]. Under the assump- 
tion of a finite partition {Xi}, note that Fano's inequality [44, 
pp. 39], where H 2 {p) = — plogp — (1 — p) log(l — p), 

L(X ^Y)< H 2 (P e ) + P e log (card({A-J) - 1) (62) 



trivially holds. We further note that in the equation above 
one can exchange card({A?j}) by esssup ye; y card(<7 _1 [y]) 
to improve the bound. In what follows, we aim at further 
improvements. 

Definition 8 (Bijective Part). Let X b be the maximal set such 
that g restricted to this set is injective, and let 34 be the image 
of this set. Thus, g: Xjj — > 34 bijectively, where 



Xh 



{x <= X: cardCg- 1 ^)]) = 1}. 



(63) 



Then A = P x {X b ) 
probability mass. 



Py(yb) denotes the bijectively mapped 



Proposition 11 (Fano-Type Bound). For the MAP reconstruc- 
tor - or any reconstructor for which r(y) G g ~ 1 [y] - the 
information loss L[X — > Y) is upper bounded by 

L(X -^Y)< min{l - P b , H 2 (P e )} - P e \ogP e 

+ P e log(E{card( 5 - 1 [^])"l})- (64) 

Proof: See Appendix [Ej ■ 
If we compare this result with Fano's original bound (l62t , 
we see that the cardinality of the partition is replaced by the 
expected cardinality of the preimage. Due to the additional 
term P e log P e this improvement is only potential, since there 
exist cases where Fano's original bound is better. An example 
is the square-law device of Example [3] for which Fano's 
inequality is tight, but for which Proposition QT| would yield 
L(X -> Y) < 2. 

For completeness, we want to mention that for the MAP 
reconstructor also a lower bound on the information loss can 
be given. We restate 

Proposition 12 (Feder & Merhav, [45]). The information loss 
L(X — > Y) is lower bounded by the error probability P e of 
a MAP reconstructor by 

<t>(P e ) < L(X -> Y) (65) 

where (f>(x) is a piecewise linear function defined as 

4>{x) =(x- -—-^j (* + 1)« lo S f 1 + JJ + lo S i ( 66 ) 

for i=l < x < 4t- 

At the time of submission we were not able to improve 
this bound for the present context since the cardinality of the 
preimage has no influence on <fi. 

We illustrate the utility of these bounds, together with those 
presented in Proposition [9] in 

Example 6 (Third-order Polynomial). Consider the function 
depicted in Fig. [6] which is defined as 



lOOx. 



(67) 



The input to this function is a zero-mean Gaussian RV X with 
variance a 2 . A closed-form evaluation of the information loss 
is not possible, since the integral involves the logarithm of a 
sum. However, we note that 



Xh 



-oo, - 



20 
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, oo 



(68) 
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Fig. 6. Third-order polynomial of Example [6] and its MAP reconstractor 
indicated with a thick red line. 



L(X -> Y) 




10 20 30 40 

Fig. 7. Information loss for Example[6]as a function of input variance cr 2 . 



and thus P b = 



where Q denotes the Q- 



function [36, 26.2.3]. With a little algebra we thus obtain the 
bounds from Proposition [9] as 

L(X -> Y) < (1 - P b ) log 3 < log (3 - 2P b ) < log 3 (69) 

where esssup ye: y cardfg -1 [y]) = card({Ai}) = 3. 

As it can be shownj, the MAP reconstructor assumes the 
properties depicted in Fig. [6] from which an error probability 

2«m-2 m (70) 



p 



V3a 



\V3a 



can be computed. We display Fano's bound together with the 
bounds from Propositions [9] [TT| and [12] in Fig. [7] 

What becomes apparent from this example is that the 
bounds from Propositions l9land[TTIcannot form an ordered set; 
the same holds for Fano's inequality, which can be better or 
worse than our Fano-type bound, depending on the scenario. 

While in this example the MAP reconstructor was relatively 
simple to find, this might not always be the case. For bounding 
the information loss of a system (rather than reconstructing 

6 The authors thank Stefan Wakolbinger for pointing us to this fact. 



the input), it is therefore desirable to introduce a simpler, sub- 
optimal reconstructor: 

Proposition 13 (Suboptimal Reconstruction). Consider the 
following sub-optimal reconstructor 



r S uh(y) 



g- x (y), 

x: x £ X k , 



ifyey b 
ifyey k \y b 

else 



where 



k = arg max Px (^i U X b ) 



(71) 



(72) 



and where 34 = g(Xk)- 

Letting K = esssup yg; y card(5 _1 [y]) and with the error 
probability 



P e = l-P x {X k \JX b 



(73) 



of this reconstructor, the information loss is upper bounded by 
the following, Fano-type inequality: 



L{X^Y)<l-P b + P e log (K - 1) 



(74) 



Proof: See Appendix |F| ■ 
This reconstructor is simple in the sense that the reconstruc- 
tion is always chosen from the element X k containing most 
of the probability mass, after considering the set on which the 
function is bijective. This allows for a simple evaluation of 
the reconstruction error probability P e , which is independent 
of the Jacobian determinant of g. 

It is interesting to see that the Fano-type bound derived 
here permits a similar expression as derived in Proposition Q~T] 
despite the fact that the sub-optimal reconstructor not necessar- 
ily satisfies r su b(y) G For this type of reconstructors, 
(card(-) — 1) typically has to be replaced by card(-). We thus 
note that also the following bounds hold: 

L(X -^Y)< H 2 {P e )+P e \og ( esssupcard^fe/DV 5 ) 

V yey J 

< H 2 {P e ) + P e log (card({^») (76) 
L(X -^Y)< min{l - P b , H 2 (P e )} - P e log P e 

+ P e log(E{card( 5 - 1 [^])}) (77) 
Before proceeding, we want to briefly reconsider 

Example (Infinite Loss (revisited)). For the PDF and the 
function depicted in Fig. [4] it was shown that the information 
loss was infinite. By recognizing that the probability mass 
contained in Xi = (^,1] exceeds the mass contained in all 
other subsets, we obtain an error probability for reconstruction 
equal to 

Pe = r^r « 0.63. (78) 
log 3 

In this particular case we even have P e — P e , since the MAP 
reconstructor coincides with the suboptimal reconstructor. 
Since in this case card(<7~ 1 [y]) — oo for all y E y, all upper 
bounds derived from Fano-type bounds evaluate to infinity. 
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IV. Information Loss for Functions which Reduce 
Dimensionality 

We now analyze systems for which the absolute information 
loss L(X — > Y) is infinite. Aside from practically irrelevant 
cases as in Example [4] this subsumes cases where the di- 
mensionality of the input signal is reduced, e.g., by dropping 
coordinates or by keeping the function constant on a subset of 
its domain. 

Throughout this section we assume that the input RV has 
positive information dimension, i.e., d(X) > and infinite 
entropy H(X) = oo. We further assume that the function g 
describing the system is such that the relative information loss 
l(X — > Y) is positive (from which L(X — > Y) = oo follows; 
cf. Proposition [6j. 

According to Proposition|4]the relative information loss in a 
Lipschitz function is positive whenever the information dimen- 
sion of the input RV X is reduced. Interestingly, a reduction 
of the dimension of the support X does not necessarily lead to 
a positive relative information loss, nor does its preservation 
guarantee vanishing relative information loss. 

A. Relative Information Loss for Continuous Input RVs 

We again assume that X C M. N and Px <C fi N , thus 
d(X) = N . We already found in Proposition [4] that for 
Lipschitz functions the relative information loss is given as 



I(X->F) = 1- 



d(Y) 
N 



(79) 



where Y may be a mixture of RVs with different information 
dimensions (for which d(Y) can be computed; cf. 11261 . EH). 
Such a mixture may result, e.g., from a function g mapping 
different subsets of X to sets of different covering dimension. 
We intend to make this statement precise in what follows. 

First, let us drop the requirement of Lipschitz continuity; 
generally, we now cannot expect Proposition |4] to hold. We 
assume that g is piecewise defined, as in Definition [5] Here, 
however, we do not require gc Xi — > 3^ to be bijective, but 
to be a submersion, i.e., a smooth function between smooth 
manifolds whose pushforward is surjective everywhere (see, 
e.g., [46]). A projection onto any M < N coordinates of X, 
for example, is a submersion. With these things in mind, we 
present 

Proposition 14 (Relative Information Loss in Dimensionality 
Reduction). Let {X{\ be a partition of X such that each 
of its K elements is a smooth N -dimensional manifold. Let 
g be such that gi — g\x i are submersions to smooth Mi- 
dimensional manifolds (Mi < N). Then, the relative 
information loss is 



K 



l(X^Y) 



i=l 



N -Mi 
) 

N 



(80) 



Proof: See Appendix iGl ■ 
This result shows that the statement of Proposition |4] not 
only holds for Lipschitz functions g, but for a larger class of 
systems yet to be identified. 



We present two Corollaries to Proposition [14] concerning 
projections onto a subset of coordinates and functions which 
are constant on some subset with positive Px-measure. 

Corollary 4. Let g be any projection of X onto M of its 
coordinates. Then, the relative information loss is 

N -M 



l(X -> Y) 



N 



(81) 



Corollary 5. Let g be constant on a set A C X with positive 
Px-measure. Let furthermore g be such that cardf^ -1 [y]) < 
oo for all y (£ g(A). Then, the relative information loss is 



1{X ^Y)=P X {A). 



(82) 



The first of these two corollaries has been applied to 
principle component analysis in 11471 . while the second allows 
us to take up Example [2] (center clipper) again: There, we 
showed that both the information loss and the information 
transfer are infinite. For the relative information loss we can 
now show that it corresponds to the probability mass contained 
in the clipping region, i.e., l(X — > Y) = Px([—c, c]). 

The somewhat surprising consequence of these results is 
that the shape of the PDF has no influence on the relative 
information loss; whether the PDF is peaky in the clipping 
region or flat, or whether the omitted coordinates are highly 
correlated to the preserved ones does neither increase nor 
decrease the relative information loss. 

In particular, in [47 1 we showed that dimensionality reduc- 
tion after performing a principle component analysis leads to 
the same relative information loss as directly dropping N — M 
coordinates of the input vector X. 

B. Bounds on the Relative Information Loss 

Complementing the results from Section IIII-BI we now 
present bounds on the relative information loss for some 
particular cases. We note in passing that from the trivial 
bounds on the information dimension (d(X) € [0, N] if 
X is a subset of the iV-dimensional Euclidean space or a 
sufficiently smooth iV-dimensional manifold) simple bounds 
on the relative information loss can be computed. 

Here we present bounds on the relative information transfer 
and the relative information loss for an iV-dimensional input 
RV by the corresponding coordinate-wise quantities. 

Proposition 15 (Upper Bound on the Relative Information 
Transfer). Let g be a Lipschitz function with N -dimensional 
input X and K -dimensional output Y. The relative informa- 
tion transfer is bounded by 



K 



(83) 



t{X->Y) <^f(X^7 (t) ) 

i=l 

where Y^ is the i-th coordinate ofY. 

Proof: See Appendix IH1 ■ 
This upper bound on the relative information transfer (which 
leads to a lower bound on the relative information loss) can 
also be applied if the system has K one-dimensional output 
RVs, in which case Y denotes denotes their collection. 
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Proposition 16 (Upper Bound on the Relative Information 
Loss). Let X be an N -dimensional RV with a probability 
measure Px <C (J, and let Y be N -dimensional. Then, 

N N 



(84) 

where X^ 1 ' and Y^ are the i-th coordinates of X and Y, 
respectively. 

Proof: See Appendix [I] ■ 

Example 7 (Projection). Let X be an iV-dimensional RV with 
probability measure Px <C p N and let X^> denote the i-th 
coordinate of X. Let g be a projection onto the first M < N 
coordinates. The information loss is given as 

. N — M 
l(X y) = — ^- (85) 

by Corollary g] Note further that t(X -> yW) = ± for all 
i € {1, . . . , M}, which renders the bound of Proposition [TBI 
tight. Furthermore, l(X® -> Y) = for i G {1,...,M}, 
while Y) = 1 for i G {M + 1, . . . , N} which shows 

tightness of Proposition [16] as well. 



C. Reconstruction and Reconstruction Error Probability 

We next take up the approach of Section IIII-CI and present 
Fano-type relations between the relative information loss and 
the probability of a reconstruction error. While for piecewise 
bijective functions this relation was justified by the fact that for 
every output value the preimage under the system function is a 
countable set, the case is completely different here: Quantizers, 
for example, characterized with relative information loss in our 
framework, are typically evaluated based on some energetic 
measures (e.g., the mean-squared reconstruction error). As the 
following example shows, the relative information loss does 
not permit a meaningful interpretation in energetic terms, again 
underlining the intrinsically different behavior of information 
and energy measures. 

Example [JJ (Quantizer (revisited)). We now consider a con- 
tinuous one-dimensional RV X (Px <C p) and the quantizer 
introduced in Section Hl-AI Since the quantizer is constant Px- 
a.s., we obtain with Corollary [5J 



l(X -> X n ) = 1. 



(86) 



In other words, the quantizer destroys 100% of the information 
available at its input. This naturally holds for all n, so a finer 
partition V n cannot decrease the relative information loss. 
Conversely, the mean-squared reconstruction error decreases 
with increasing n. 

We therefore turn to find connections between relative 
information loss and the reconstruction error probability after 
introducing 

Definition 9 (Minkowski Dimension). The Minkowski- or 
box-counting dimension of a compact set X C is 



d B {X) 



lim 

n—s-oo 



log card(7 , „) 




d(X)/d B (X) 



Fig. 8. The accessible region for a (P e ,l(X —¥ Y))-pair for d(X) = 
0.6dg(X). Points with numbers indicate references to the numbered exam- 
ples in this work. Note that the center clipper (example |2J occurs three times 
in the plot, representing each instance in the text. 



where the partition V n is induced by a uniform vector quan- 
tizer with quantization interval 

The Minkowski dimension of a set equals the information 
dimension of a uniform distribution on that set (e.g., [48 1), 
and is a special case of Renyi information dimension where the 
entropy is replaced with the Renyi entropy of zeroth order [49 1. 
We are now ready to state 

Proposition 17. Let X be an RV with a probability measure 
Px with positive information dimension d(X) supported on 
a compact set X C M w with positive Minkowski dimension 
dsiX). Then, the error probability bounds the relative infor- 
mation loss from above, i.e., 



l(X ->Y) <P C 



MX) 
d(X) ■ 



(88) 



it can be showrQ 
Comparing this 



Proof: See Appendix Q] 

In the case where X C R N and P x < (J- N 
that this result simplifies to l(X — > Y) < P e 
to the results of Example Q] we can see that for a quantizer 
the reconstruction error probability is always P e = 1. 

The possible region for a (P e ,l(X — > F))-pair is depicted 
in Fig. |8] Note that, to our knowledge, this region cannot be 
restricted further. For example, with reference to Section [III] 
there exist systems with l(X — > Y) =0 but with P e > 0. 
Conversely, for a simple projection one will have P e = 1 
while l(X -> Y) < 1. Finally, that l(X -> Y) = 1 need not 
imply P e = 1 can be shown by revisiting the center clipper: 

Example\2\(Center Clipper (revisited)). Assume that the input 
probability measure Px is mixed with an absolutely contin- 
uous component supported on [— c, c] (0 < Px([—c,c]) < 1) 
and a point mass at an arbitrary point xq ^ [— c, c]. According 
to (261, ED, we have d(X) = P x ([~c,c\). The output 
probability measure Py has two point masses at and xq with 



(87) 



'Always, d(X) < d B (X) < N if X C M J 
Lem. 4]. d(X) = N leads to the desired result. 



e.g., by [50 Thm. 1 and 
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iV(0) = Px([~c,c]) and P Y {x ) = 1- P x ([-c,c]), respec- 
tively. Clearly, d(X\Y = 0) = 1 while d(X\Y = x ) = 
Consequently, 

d(x|y) 



l(X -> Y) 



d(X) 



1. 



(89) 



In comparison to that, we have P e < Px([—c,c\), since one 
can always use the reconstructor r(y) = xo for all y. 

This is a further example where Proposition |4]holds, despite 
that neither the center clipper is Lipschitz, nor that the require- 
ment of a continuously distributed input RV in Proposition [Pfl 
is met. 

It is worth mentioning that Proposition [T7] allows us to 
prove a converse to lossless analog compression, as it was 
investigated in ll26l . OP . To this end, and borrowing the 
terminology and notation from [26], we encode a length-n 
block X" of independent realizations of a real-valued input RV 
X with information dimension < d(X) < 1 via a Lipschitz 
mapping to the Euclidean space of dimension [Rn\ < n. Let 
R(e) be the infimum of R such that there exists a Lipschitz 
g : W 1 — > RL r «J a nd an arbitrary (measurable) reconstructor 
r such that P e < e. 

Corollary 6 (Converse for Lipschitz Encoders; connection 
to 1261 . B51I eq. (26)]). For a memoryless source with com- 
pactly supported marginal distribution Px and information 
dimension < d(X) < 1, and a Lipschitz encoder function g, 



R(e) > d(X) 



(90) 



Proof: Since X n is the collection of n real-valued, inde- 
pendent RVs it follows that X n = R™ and thus d B (X n ) = n. 
With Proposition [TTI we thus obtain 

nP e > d(X n )l(X n -> Y") 



( = } d{X n ) - d(Y n ) 

(b) 



(91) 
(92) 

nd(X) - d(Y n ) (93) 

where (a) is due to Proposition |4] and (b) is due to the fact that 
the information dimension of a set of independent RVs is the 
sum of the individual information dimensions (see, e.g., [34| 
or E3 Lem. 3]). Since Y" is an mLR«J . valued RV, d(Y n ) < 
[Rn\ . Thus, 



nP e > nd(X) - [Rn\ > nd(X) - Rn. 



(94) 



Dividing by the block length n and rearranging the terms 
yields 

R>d(X)-P e . (95) 

This completes the proof. ■ 
While this result - compared with those presented in [26 1, 
1 5 1 1 - is rather weak, it suggests that our theory has rela- 
tionships with different topics in information theory, such as 
compressed sensing. Note further that we need not restrict 
the reconstructor, since we only consider the case where 
already the encoder - the function g - loses information. The 
restriction of Lipschitz continuity cannot be dropped, however, 
since only this class of functions guarantees that Proposition!!] 
holds. In general, as stated in (26), there are non-Lipschitz 
bijections from R™ to R. 



D. Special Case: ID-maps and mixed RVs 

We briefly analyze the relative information loss for scenarios 
similar to the one of Example [2] We consider the case where 
X, y C R, but we drop the restriction that Px -C M- Instead, 
we limit ourselves to mixtures of continuous and discrete 
probability measures, i.e., we assume that Px has no singular 
continuous component. Thus, ll33l pp. 121] 

yd 



Px = P 



x 



P 



x- 



(96) 



According to 
present 



30) we obtain d(X) = P X C (X). We 



Proposition 18 (Relative Information Loss for Mixed RVs). 
Let X be a mixed RV with a probability measure Px — Px + 
P x , < P X C {X) < 1. Let {Xi} be a finite partition of X C R 
into compact sets. Let g be a bounded function such that g\x t 
is either injective or constant. The relative information loss is 
given as 

P X %A) 



1{X Y) 



Pxi*) 



(97) 



where A is the union of sets Xi on which g is constant. 

Proof: See Appendix iKl ■ 
If we compare this result with Corollary we see that the 
former implies the latter for P X C (X) — 1. Moreover, as we 
saw in Example |2] the relative information loss induced by 
a function g can increase if the probability measure is not 
absolutely continuous: In this case, from Px{[—c,c}) (where 
Px <C /i) to 1 . As we will show next, the relative information 
loss can also decrease: 

Example (Center Clipper (revisited)). Assume that 
P X C (X) = 0.6 and Pf c ([-c,c]) = 0.3. The remaining proba- 
bility mass is a point mass at zero, i.e., Px(0) — P x (fy = 0.4. 
It follows that d(X) = 0.6 and, from Proposition [TBI l(X -t 
Y) = 0.5. We choose a fixed reconstructor r(0) = with 



a reconstruction error probability P e 
Using Proposition [TTI we obtain 



Pf c (hc,c]) = 0.3. 



0.5 = l(X 



y) < d -m Pp 



-0.3 = 0.5 



(98) 



d(X) "° 0.6' 

which shows that in this case the bound holds with equality. 

Consider now the case that the point mass at is split 
into two point masses at a, b € [— c, c], where P x {a) = 0.3 
and P x (b) = 0.1. Using r(0) = a the reconstruction error 
increases to P e = Px([—c,c]) — Pxi a ) — 0-4. Proposition [TTI 
now is a strict inequality. 



V. Implications for a System Theory 

In the previous sections we have developed a series of 
results about the information loss - absolute or relative - 
caused by deterministic systems. In fact, quite many of the 
basic building blocks of static, i.e., memoryless, systems have 
been dealt with: Quantizers, the bridge between continuous- 
and discrete-amplitude systems, have been dealt with in Ex- 
ample Q] Cascades of systems allow a simplified analysis 
employing Propositions [T] and [5] Dimensionality reductions of 
all kinds - functions which are constant somewhere, omitting 
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coordinates of RVs, etc. - where a major constituent part of 
Section [TV] Finally, the third-order polynomial (Example [6} 
is significant in view of Weierstrass' approximation theorem 
(polynomial functions are dense in the space of continu- 
ous functions supported on bounded intervals). Connecting 
subsystems in parallel and adding the outputs - a case of 
dimensionality reduction - is so common that it deserves 
separate attention: 

Example 8 (Adding Two RVs). We consider two TV- 
dimensional input RVs X\ and X%, and assume that the output 
of the system under consideration is given as 

Y = X 1 +X 2 (99) 

i.e., as the sum of these two RVs. 

We start by assuming that X\ and X 2 have a joint prob- 
ability measure Px x ,x% "C p 2N ■ As it can be shown rather 
easily by transforming X\ , X 2 invertibly to X\ + X 2 , X\ and 
dropping the second coordinate, it follows that in this case 

l(X u X 2 ^Y) = ^. (100) 

Things may look totally different if the joint probability 
measure Pxi,X 2 is supported on some lower-dimensional sub- 
manifold of M. 2N . Consider, e.g., the case where X 2 = —X\, 
thus Y = 0, and Z(Ai,A 2 ->• Y) = 1. In contrary to this, 
assume that both input variables are one-dimensional, and that 
X 2 = -O.OlAj 5 . Then, as it turns out, 

Y = X x - O.OlXf = -0.01(Xf - lOOXi) (101) 

which is a piecewise bijective function. As the analysis of 
Example [6] shows, l(X\,X 2 — > Y) = in this case. 

We will next apply these results to systems which are larger 
than the toy examples presented so far. In particular, we focus 
on an autocorrelation receiver as an example for a system 
which looses an infinite amount of information, and on an 
accumulator which will be shown to loose only a finite amount. 
Yet another example - the analysis of principle component 
analysis employing the sample covariance matrix - can be 
found in the extended version of PTl . We will not only analyze 
the information loss in these systems, but also investigate 
how information propagates on the signal flow graph of the 
corresponding system. This will eventually mark a first step 
towards a system theory from an information-theoretic point- 
of-view. 

A. Multi-Channel Autocorrelation Receiver 

The multi-channel autocorrelation receiver (MC-AcR) was 
introduced in [52] and analyzed in ll53l . Il54l as a non-coherent 
receiver architecture for ultrawide band communications. In 
this receiver the decision metric is formed by evaluating the 
autocorrelation function of the input signal for multiple time 
lags (see Fig. |9). 

To simplify the analysis, we assume that the input signal 
is a discrete-time, complex-valued A-periodic signal superim- 
posed with independent and identically distributed complex- 
valued noise. The complete analysis can be based on A 
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Fig. 10. Equivalent model for multiplying the two branches in Fig. [9] 

consecutive values of the input, which we will denote with 
AW through AW (A is the collection of these RVs). The 
real and imaginary parts of will be denoted as 5JAw and 
SAW, respectively. We may assume that Px <C fi 2N , where 
Px is compactly supported. We consider only three time lags 
k u k 2 ,k 3 e {1,...,A- 1}. 

By assuming periodicity of the input signal, we can re- 
place the linear autocorrelation by the circular autocorrelation 
(e.g., [1, pp. 655]) 

N-l N-l 

i?« = x^x< n +^ = Yk n) ( 102 ) 

n=0 n=0 

where * denotes complex conjugation, and where k assumes 
one value of the set {^1,^2,^3}- Note further that the circular 
autocorrelation here is implicit, since X^ — X^ 1+N \ 

We start by noting that for k = 0, l(X —> Yq) — |, since 
then 

y{n) _ x (n) x *(n) = | X (n)|2 (1Q3) 

which shows that Yq is real and thus Py <C p. N ■ Since 

is the sum of the components of Yq, it is again real and we 

get 

t(X -> i? (0) ) = — . (104) 
For non-zero k note that (see Fig. ITOb 

x (n) x *(n+k) _ e logX < ' ,) Jf* < '*+ fc ' 

_ e logX<">+logX*(" + fc ' _ e logX(")+(logX<"+ fc >)* 

= ef »,^„ +l)=ef w =y W (105) 

In other words, we can write the multiplication as an addition 
(logarithm and exponential function are invertible and, thus, 
information lossless). 

Letting A& denote the vector of elements indexed by 

X( n+k \ we get 

A fe = C fc A (106) 

where is a circulant permutation matrix. Thus, Y/. = X + 
A fe and 

RK fc = (I + C k )®X (107) 
%Y k = (I - C k )ZX (108) 

where I is the Ax A identity matrix. Since I+Cfc is invertible, 
we have P^y <C fi N . In contrary to that, the rank of I — 
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Y k2 
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t(X -> R^ , R^ , R^) < jfr 



Fig. 9. Discrete-time model of the multi-channel autoconelation receiver: The elementary mathematical operations depicted are the complex conjugation 
(*), the summation of vector elements Q^), and the circular shift (blocks with fcj). The information flow is illustrated using red arrows, labeled according to 
the relative information transfer. 



is TV — 1 and thus d(%Y k ) = 



N — 1. It follows that 



2JV- 1 
27V 



(109) 



and d(Y k ) = 2N - 1. 

Clearly, for fc 7^ the autocorrelation will be a complex 
number a.s., thus d(R^ k ' > ) = 2. Since the summation is a 
Lipschitz function, we obtain 

t(Y k -> i?^) = " 



2iV- 1 

and by the result about the cascades, 

t(X -> = — . 

Finally, applying Proposition [15] 



t(X ->■ i? (fcl) ,i? (fc2) ,i? (fc3) ) 



< 



3 

iV' 



(HO) 



(111) 



(112) 



Note that this analysis would imply that if all values of 
the autocorrelation function would be evaluated, the relative 
information transfer would increase to 

2JV-1 



t(X -+R)< 



2N 



(113) 



However, knowing that the autocorrelation function of a com- 
plex, periodic sequence is Hermitian and periodic with the 
same period, it follows that Pr <C fi N , and thus t(X —> 
R) = 5. The bound is thus obviously not tight in this case. 
Note further that, applying the same bound to the relative 
information transfer from X to Y kl , Y k2 , Y k3 would yield a 
number greater than one. This is simply due to the fact that 
the three output vectors have a lot of information in common, 
prohibiting simply adding their information dimensions. 

A slightly different picture is revealed if we look at an 
equivalent signal model, where the circular autocorrelation is 



L(X ^F x ) = 



I 

X 



L(F R ->R)=0 
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Fig. 11. Equivalent model of the MC-AcR of Fig. [9] using the DFT to 
compute the circular autocorrelation. IDFT denotes the inverse DFT and II 
a projection onto a subset of coordinates. The information flow is indicated 
by red arrows labeled according to the relative information transfer. 



computed via the discrete Fourier transform (DFT, cf. Fig. UTT i: 
Letting W denote the DFT matrix, we obtain the DFT of 
X as F x = WI. Doing a little algebra, the DFT of the 
autocorrelation function is obtained as 



F R = \F x \ 2 



(114) 



Since the DFT is an invertible transform (W is a unitary 
matrix), one has d(F x ) = d(X) = 2N. The squaring of 
the magnitude can be written as a cascade of a coordinate 
transform (from Cartesian to polar coordinates; invertible), a 
dimensionality reduction (the phase information is dropped), 
and a squaring function (invertible, since the magnitude is a 
non-negative quantity). It follows that 



l(X -+R) = l(F x -> F R ) = 
since the inverse DFT is again invertible. 



(115) 



Since Fr is a real RV and thus Pp R -C /x , it clearly 



follows that also Pr -C fi , despite the fact that R is a 
vector of N complex numbers. The probability measure of 
this RV is concentrated on an iV-dimensional submanifold of 
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PX 



S (i-i) 



S (i) , 



Fig. 12. Accumulating a sequence of independent, identically distributed 
RVs XW 



E 2JV defined by the periodicity and Hermitian symmetry of 
R. Choosing three coordinates of R as the final output of the 
system amounts to upper bounding the information dimension 
of the output by 



Markov chain is a positive circulant matrix buill0 from px- 
As a consequence of the Perron-Frobenius theorem (e.g., |0] 
Thm. 15-8]) there exists a unique stationary distribution which, 
for a doubly stochastic matrix as in this case, equals the 
uniform distribution on X (cf. If55l Thm. 4.1.7]). 

To attack the problem, we note that the PMF of the sum 
of independent RVs is given as the convolution of the PMFs 
of the summands. In case of the modulo sum, the circular 
convolution needs to be applied instead, as can be shown by 
computing one Markov step. Using the DFT again, we can 
write the circular convolution as a multiplication; in particular 
(see, e.g., ffl Sec. 8.6.5]) 



(px * Psw ) 



(120) 



Thus, 



d{R {kl \R (k2 \R (ki) ) < 6. 



t(X -> R( kl) ,R (k2) ,R (k3) ) < — 



where F px = Wpjf. Iterating the system a few time steps 



(116) and repeating this analysis yields 



Psv 



= W~ l Fl 



(121) 



(117) 



where equality is achieved if, e.g., all time lags are distinct 



and smaller than y. The information flow for this example - the information lost in taken F px to the i-th power, 
computed from the relative information loss and information 
transfer - is also depicted in Figs. l9"land[PTl 



where the i-th power is taken element-wise. Since neither DFT 
nor inverse DFT lose information, we only have to consider 



We now employ a few properties of the DFT (see, e.g., 0] 
Sec. 8.6.4]): Since px is a real vector, F px will be Hermitian 
(circularly) symmetric; moreover, if we use indices from to 



B. Accumulator 

As a further example, consider the system depicted in 
Fig. [T2] We are interested in how much information we loose 
about a probabiliy measure of an RV X if we observe the 
probability measure of a sum of iid copies of this RV. For 
simplicity, we assume that the probability measure of X is 
supported on a finite field X = {0, . . . , N — 1}, where we 
assume N is even. 

The input to the system is thus the probability mass function 
(PMF) px- Since its elements must sum to unity, the vector 
px can be chosen from the (N— 1) -simplex in K> N ; we assume 
P Px <C ij, 1 ^^ 1 . The output of the system at a certain discrete 
time index i is the PMF p s ^) of the sum of i iid copies 
of X: 



N - 1, we have F^J = 1 and Fjj£ ' € 

Taking the i-th power of a real number j3 looses at most 
one bit of information for even i and nothing for odd i; thus 



(f) 



(122) 



W -> P l ) < cos' - 



Taking the power of a complex number a corresponds to 
taking the power of its magnitude (which can be inverted by 
taking the corresponding root) and multiplying its phase. Only 
the latter is a non-injective operation, since the i-th root of 
a complex number yields i different solutions for the phase. 
Thus, invoking Proposition [9] 



L(a -> a 1 ) < logi. 



(123) 



(*■) 



( 118 ) Applying (fl23l to index § and (fl23l to the indices 



k=l 



where denotes modulo-addition. 

Given the PMF p S (i) of the output at some time i 
(e.g., by computing the histogram of multiple realizations of 
this system), how much information do we lose about the 
PMF of the input XI Mathematically, we are interested in 
the following quantity 



{1, . . . , y — 1} reveals that 

L(p x -+p s m) = L(F PX - 



L(p x ->Psw) 



(119) 



From the theory of Markov chains we know that pgw 
converges to a uniform distribution on X. To see this, note 
that the transition from 5^ to S^ t+1 ^ can be modeled as a 
cyclic random walk; the transition matrix of the corresponding 



< ^-ijlogz + cos 2 ^). (124) 

Thus, the information loss increases linearly with the number 
of components, but sublinearly with time. Moreover, since 
for i — > oo, p S (i) converges to a uniform distribution, it is 
plausible that L(px — > Ps(<)) — > oo (at least for N > 2). 
Intuitively, the earlier one observes the output of such a 
system, the more information about the unknown PMF px 
can be retrieved. 



8 Note that L(p x -> is not to be confused with H(S^\S^), a 

quantity measuring the information loss about the initial state of a Markov 
chain. 



'in this particular case it can be shown that the first row of the transition 
matrix consists of the elements of px , while all other rows are obtained by 
circularly shifting the first one. 



17 



TABLE I 

Comparison of different information flow measures. The input 
is assumed to have positive information dimension. The letter c 
indicates that the measure evaluates to a finite constant. 



System 


I(X;Y) 


L(X -» Y) 


t(X -> Y) 


/(X 


-+Y) 


Quantizer 


c 


CO 







1 


Rectifier 


oo 


c 


1 







Center Clipper 


CO 


oo 


c 


1 


— c 


Example [4] 


oo 


oo 


1 







MC-AcR 


CO 


oo 


c 


1 


— c 


Accumulator 


CO 


c 


1 








C. Metrics of Information Flow - Do We Need Them All? 

So far, for our system theory we have introduced absolute 
and relative information loss, as well as relative information 
transfer. Together with mutual information this makes four 
different metrics which can be used to characterize a deter- 
ministic input-output system. While clearly relative informa- 
tion loss and relative information transfer are equivalent, the 
previous examples showed that one cannot simple omit the 
other measures of information flow without losing flexibility 
in describing systems. Assuming that only finite measures 
are meaningful, a quantizer, e.g., requires mutual informa- 
tion, whereas a rectifier would need information loss. Center 
clippers or other systems for which both absolute measures 
are infinite, benefit from relative measures only (be it either 
relative information transfer or relative information loss). We 
summarize notable examples from this work together with 
their adequate information measures in Table |I] 



VI. Open Issues and Outlook 

While this work may mark a step towards a system theory 
from an information-theoretic point-of-view, it is but a small 
step: In energetic terms, it would "just" tell us the difference 
between the signal variances at the input and the output of 
the system - a very simple, memoryless system. We do not 
yet know anything about the information-theoretic analog of 
power spectral densities (e.g., entropy rates), about systems 
with memory, or about the analog of more specific energy 
measures like the mean-squared reconstruction error assuming 
a signal model with noise (information loss "relevant" in view 
of a signal model). Moreover, we assumed that the input signal 
is sufficiently well-behaved in some probabilistic sense. Future 
work should mainly deal with extending the scope of our 
system theory. 

The first issue which will be addressed is the fact that at 
present we are just measuring information "as is"; every bit 
of input information is weighted equally, and losing a sign 
bit amounts to the same information loss as losing the least 
significant bit in a binary expansion of a (discrete) RV. This 
fact leads to the apparent counter-intuitivity of some of our 
results: To give an example from [47 1, the principle component 
analysis (PCA) applied prior to dimensionality reduction does 



not decrease the relative information loss0; this loss is always 
fully determined by the information dimension of the input 
and the information dimension of the output (cf. Corollary 2J. 
Contrary to that, the literature employs information theory to 
prove the optimality of PCA in certain cases ifTOl . [57 1, [58] 
- but see also ||59ll for a recent work presenting conditions 
for the PCA depending on the spectrum of eigenvalues for 
a certain signal-noise model. To build a bridge between our 
theory of information loss and the results in the literature, the 
notion of relevance has to be brought into game, allowing us to 
place unequal weights to different portions of the information 
available at the input. In energetic terms: Instead of just 
comparing variances - which is necessary sometimes! - we 
are now interested, e.g., in a mean-squared reconstruction error 
w.r.t. some relevant portion of the input signal. We actually 
proposed the corresponding notion of relevant information loss 
in Il60l . where we showed its applicability in signal processing 
and machine learning and, among other things, re-established 
the optimality of PCA given a specific signal model. We 
furthermore showed that this notion of relevant information 
loss is fully compatible with what we present in this paper. 

Going from variances to power spectral densities, or, from 
information loss to information loss rates, will represent the 
next step: If the input to our memoryless system is not a 
sequence of independent RVs but a discrete-time stationary 
stochastic process, how much information do we lose per 
unit time? Following [8|, the information loss rate should be 
upper bounded by the information loss (assuming the marginal 
distribution of the process as the distribution of the input RV). 
Aside from this, little is known about this scenario, and we 
hope to bring some light into this issue in the future. Of 
particular interest would be the reconstruction of nonlinearly 
distorted sequences, extending Sections IIII-CI and IIV-CI 

The next, bigger step is from memoryless to dynamical 
input-output systems: A particularly simple subclass of these 
are linear filters, which were already analyzed by Shannon0 
In the discrete-time version taken from ||4] pp. 663] one gets 
the differential entropy rate of the output process Y by adding 
a system-dependent term to the differential entropy rate of the 
input process X, or 

h (Y) = h (X) + i- J* log \H{er*)\dB (125) 

where H(e je ) is the frequency response of the filter. The fact 
that the latter term is independent of the process statistics 
shows that it is only related to the change of variables, and not 
to information loss. In that sense, linear filters do not perform 
information processing where the change of information mea- 
sures should obviously depend on the input signal statistics. 

10 This opposes the intuition that preserving the subspace with the largest 
variance should also preserve most of the information: For example, the 
Wikipedia article [56] states, among other things, that "[...] PCA can supply 
the user with a lower-dimensional picture, a 'shadow' of this object when 
viewed from its (in some sense) most informative viewpoint" and that the 
variances of the dropped coordinates "tend to be small and may be dropped 
with minimal loss of information". 

"To be specific, in his work [61] he analyzed the entropy loss in linear 
filters. 
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Finally, the class of nonlinear dynamical systems is signifi- 
cantly more difficult. We were able to present some results for 
discrete alphabets in |9). For more general process alphabets 
we can only hope to obtain results for special subclasses, e.g., 
Volterra systems or affine input systems. For example, Wiener 
and Hammerstein systems, which are cascades of linear filters 
and static nonlinear functions, can completely be dealt with 
by generalizing our present work to stochastic processes. 

Finally, many other aspects are worth investigating: The 
connection between information loss and entropy production 
in iterated function systems ll38l and to heat dissipation 
(Landauer's principle ll62l ) could be of interest. Yet another 
interesting point is the connection between energetic and 
information-theoretic measures, as it exists for the Gaussian 
distribution. 



To this end, note that the conditional probability measure 
P x , x =x k i s supported on xjf 1 , and that, thus, P Y \ X = Xk 



is supported on g(X^). Since g is Lipschitz, there exists a 
constant A such that for all a, b <G x[ n ^ 



\g(a)-g(b)\<X\a-b\. 



(128) 



Choose a and b such that the term on the left is maximized, 
i.e., 

sup \g (a) -g (b) \ = \g(a°)-g(b°)\ < \\a°-b°\ < Asup|a-6| 

a,b a,b 

(129) 

or, in other words, 



diam( s (^™ ) )) < Adiam(^ fe w ) < 



XVN 



(130) 



VII. Conclusion 

We presented an information-theoretic way to characterize 
the behavior of deterministic, memoryless input-output sys- 
tems. In particular, we defined an absolute and a relative 
measure for the information loss occurring in the system due 
to its potential non-injectivity. Since the absolute loss can be 
finite for a subclass of systems despite a continuous-valued 
input, we were able to derive Fano-type inequalitites between 
the information loss and the probability of a reconstruction 
error. 

The relative measure of information loss, introduced in 
this work to capture systems in which an infinite amount of 
information is lost (and, possibly, preserved), was shown to 
be related to Renyi's information dimension and to present a 
lower bound on the reconstruction error probability. With the 
help of an example we showed that this bound can be tight 
and that, even in cases where an infinite amount of information 
is lost, the probability of a reconstruction error need not be 
unity. 

While our theoretical results were developed mainly in view 
of a system theory, we believe that some of them may be 
of relevance also for analog compression, reconstruction of 
nonlinearly distorted signals, chaotic iterated function systems, 
and the theory of Perron-Frobenius operators. 

Appendix A 
Proof of Proposition!!] 

For the proof we need the following Lemma: 

Lemma 1. Let X and Y be the input and output of a Lipschitz 
function g: X ->• y, X C R N , y C R M . Then, 



Urn M=0. 



(126) 



Proof: We provide the proof by showing that H(Y n \X n = 
Xfc) is finite for all n and for all Xk- From this then immediately 
follows that 



lim 

n— f oo 



H(Y n \X„ 



n 



= lim - V H(Y n \X n = x k )P x (X { k n) ) = 0. (127) 

n— ^oo 77, L — » 

k 



The latter inequality follows since XI is inside an N- 
dimensional hypercube of side length ^ (one may have 

" (n) 

equality in the last statement if the hull of XI and the 
boundary of X are disjoint). 

Now note that the support of P Y \ X = Xk can ^ e covere d by 
an M-dimensional hypercube of side length diam((7(A'^™' ) )), 
which can again be covered by 



2 n diam( 5 (A' fe (n) )) + 1 



(131) 



M-dimensional hypercubes of side length ^rr. By the maxi- 
mum entropy property of the uniform distribution we get 



H{Y n \X n = x k ) < log 2"diam( ff (A'i™ ) )) + 1 



-I M 



< Mlog 



2^ + 1 



Aflog 



xVn + i 



< oo. 
(132) 



This completes the proof. ■ 
We now turn to the 

Proof of Proposition^}; By noticing that Y n is a function 
of Y and using the chain rule of mutual information we expand 
the term in Definition |4] as 



t(X -> Y) = lim 



lim 

n— >oo 



= lim 



I(X n ;Y n )+I(X n ;Y\Y n ) 
H(X n ) 

I(X n ;Y n )/n + I(X n ;Y\Y n )/n 

H{X n )/n 
H(Y n )/n + I{X n ;Y\Y n )/n 



H(X n )/n 

due to Lemma Q] 

We now note that, by Kolmogorov's formula, 

/(!„; Y\Y n ) - I(X n , Y n - Y) - I(Y n ;Y) 

(a) 



(133) 
(134) 
(135) 

(136) 



I(X; Y\Y) = I(X, Y; Y) - I(Y; Y) = (137) 

where (a) is due to the fact that X — Y — Y is a Markov 
chain (see, e.g., (28] pp. 43]). With J27] Lem. 7.22] and 
the way X n is constructed from X (cf. Il26l Sec. III.D]) 
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we find that lim, woo I(X n , Y n ; Y) = I(X,Y;Y) and 
lim„_ j . (X) I(Y n ;Y) = I(Y; Y), and that, flius@ 

lim I{X n -Y\Y n ) = 0. (138) 

n— f oo 

By employing Definition [3] the proof is completed. ■ 

Appendix B 
Proof of Proposition!?] 
Recall that Definition Q] states 

L(X -> Y) = lim f/(X n ; X) - /(X„; Y)) . (139) 

Let X n = Xk if x € . The conditional probability measure 
^"xix =£- ^ A 4 ^ anc l mus possesses a density 



if are-*". 



(n) 



0, 



else 



(140) 



where p(aifc) = -Px(<*fc™' ) ) > 0. By the same arguments as in 
the beginning of Section [TTT1 also P Y \ X =x k ^ ^ ' ■> an< ^ i ts 
PDF if given by the method of transformation. 
With 02 Ch. 8.5], 

L(X Y) 

= lim (h(X) - h(X\X n ) - h(Y) + h(Y\X n ) 

n—^oo \ ' j 

= h(X) - h(Y) + lim (h(Y\X n ) - h(X\X n j) (141) 



(142) 



n— f oo 

The latter difference can be written as 

h(Y\X n ) - h(X\X n ) 
= £p(5fc) (h{Y\X n = x k ) - h(X\X n = x k 

Since (see 0] Thm. 5-1]) 



h(Y\X n = x k ) = - J f Y ^ n (y,Xk)h3gf Y ^ n (y,x k )dy 
= - J fx\x n ( x ^k)^ogf Yljtn (g(x) 7 x k )dx 

= tt~T / , lo 8 /yix„ (ff( x )> z*)^ ( 143 ) 

we obtain 

h(Y\X n ) - h(X\X n ) 



fx\x n ( X > Xfs J 
fv\x n i9(x),x k ) 

Finally, using the method of transformation, 

fx\x„ ( Xi > Xk ) 



dx. (144) 



(145) 



, \detJ g ( Xi )\ 
a* eg 1 [s , w] 

Since the preimage of g(x) is a set separated by neighbor- 
hood£J we can find an no such that 

Vn > n : k = {k : x G ^ (n) }: ST^O)] H -^ n) = at (146) 

12 The authors thank Siu-Wai Ho for suggesting this proof method. 
13 The space M. N is Hausdorff, so any two distinct points are separated by 
neighborhoods. 



i.e., such that from this index on, the element of the partition 

~ (n) 

under consideration, X^ , contains just a single element of 
the preimage, x. Since f x \ x is non-zero only for arguments 



in x[ n \ in this case d!45t degenerates to 



, / / \ a \ fx\x n ( x > x k) 

f Y \x n {9{x),x k) - |detJa(x)| ■ 



Consequently, the ratio 

fx\x„ ( x ' Xk ) 



\det J g (x) 



(147) 



(148) 



monotonically (the number of positive terms in the sum in the 
denominator reduces with n). We can thus apply the monotone 
convergence theorem, e.g., ||33] pp. 21] and obtain 

lim h(Y\X n ) - h(X\X n ) 

n— >-oo 

= J2 I < J x{x)\og\det J g {x)\dx 

= E{log|det l 7 9 (X)|}. (149) 
This completes the proof. ■ 

Appendix C 
Proof of Proposition^ 

We start by writing 

H(W\Y) = f H(W\Y = y)dP Y {y) 
Jy 

^p(i\y)\ogp{i\y)f Y {y)dy (150) 



where 



p(i\y) = Pv(W = i\Y = y) = P x \ Y=u {Xi). 



(151) 



We now, for the sake of simplicity, permit the Dirac delta 
distribution S as a PDF for discrete (atomic) probability 
measures. In particular and following [63], we write for the 
conditional PDF of Y given X = x, 

f Yl x(x,y) = 6(y- 9 (x))= £ ■ (152) 



- |det^(s,)| 



Applying Bayes' theorem for densities we get 



K%) = / fx\v{x,y)dx 

fY\x(x,y)fx(x) 



A', 
1 



Mv) 



dx 



(153) 
(154) 



= I \d C tJ g (gr^y))\f Y (y)^ 

o, 



if y G y { 

if y $ y< 



(156) 



by the properties of the delta distribution (e.g., Il64l ) and since, 
by Definition |5] at most one element of the preimage of y lies 
in Xi. 
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We rewrite (1150) as 

H(W\Y)= -J2f P(i\y)logp(i\y)f Y (y)dy (157) 

i ^ 

after exchanging the order of summation and integration with 
the help of Tonelli's theorem [65 Thm. 2.37] and by noticing 
that p(i\y) — if y ^ 3^. Inserting the expression for p(i\y) 
and changing the integration variables by substituting x = 
g~ 1 (y) in each integral yields 

H(W\Y) 



yi \det J g (g-\y))\ \detJ g (gr\y))\f Y (y) 



/x(x) 



fix 



f X X l0g \A v<T ( \\f I ( \\ dX 

\detJ g {x)\f Y {g{x)) 



(a) 



h(JT) - + E {log \det J g (X)\} 
L(X -> Y) 



(158) 
(159) 

(160) 

(161) 
(162) 



where (a) is due to splitting the logarithm and applying [4, 
Thm. 5-1]. ' ■ 

Appendix D 
Proof of Proposition[9] 

The proof depends in parts on the proof of Proposition [8] 
where we showed that 

L{X^Y) = f H{W\Y = y)f Y {y)dy. (163) 
Jy 

The first inequality d49l is due to the maximum entropy 
property of the uniform distribution, i.e., H(W\Y = y) < 
logcard(<? _1 [y]) with equality if and only if p(i\y) = 
l/card(g -1 [y]) for all i for which g~ 1 (y) ^ 0. But this 
translates to 



card(g 1 [y]) = 



\detJ g (gr\y))\f Y (y) 



(164) 



fx{gi\y)) 

Inserting the expression for f Y and substituting x for g~ 1 {y) 
(it is immaterial which i is chosen, as long as the preimage 
of y is not the empty set) we obtain 

fx(x k ) |detjg(x)| 



card(s 1 [g(x)]) = ^ 



\detj g {x k )\ f x (x) 



(165) 

The second inequality ( l50b is due to Jensen B4l 2.6.2], 
where we wrote 

EjlogcardCg- 1 ^])} < logE {card^ 1 ^])} 

= log f cardCff-^DtMVd/) 
Jy 

= log f £ carder* ( y ))dP Y (y) =log^ j dP Y {y) 

(166) 



since card(^ 1 (y)) = 1 if y £ yi and zero otherwise. Equality 
is achieved if and only if card(<7 _1 [y]) is constant Py-a.s. 
In this case also the third inequality (IBTT l is tight, which is 
obtained by replacing the expected value of the cardinality of 
the preimage by its essential supremum. 

Finally, the cardinality of the preimage cannot be larger than 
the cardinality of the partition used in Definition [5] which 
yields the last inequality (l52l . For equality, consider that, 
assuming that all previous requirements for equality in the 
other bounds are fulfilled, (l50l yields card({Ai}) if and only 
if P Y (yi) = 1 for all i. This completes the proof. ■ 

Appendix E 
Proof of PropositionQJ] 

The proof follows closely the proof of Fano's inequality P4l 
pp. 38], where one starts with noticing that 

H{X\Y) = H{E\Y)+H(X\E,Y). (167) 

The first term, H(E\Y) can of course be upper bounded by 
H(E) — H2(P e ), as in Fano's inequality. However, also 

H{E\Y)= f H 2 (P e (y))dP Y (y) 
Jy 

H 2 (P e (y))dP Y (y) 
< f dP Y {y) = \-P b (168) 

Jy\y b 

since H-2,{P e {y)) — P e (y) = if y G 34 and since 
H 2 (P e (y)) < 1 otherwise. Thus, 



y\y b 



H(E\Y) < mm{H 2 (P e ), 1 - P b }. 



(169) 



For the second part note that H(X\E = 0, Y = y) = 0, so 
we obtain 



(170) 



H{X\E,Y)= f H(X\E=l,Y = y)P e (y)dP Y (y) 

< 

Upper bounding the entropy by log (card(g 1 [y\) — l) we get 

H{X\E,Y) <P e \ log^ard^" 1 ^]) - l) ^ldP Y (y) 
Jy r e 



(171) 



< p e log Qf (cardGrMy]) - 1) ^W(y)) ( 172 > 

< P e log (J (card^- 1 [y]) - l) dP Y (y)^j + P e log ±- 

(173) 

where (a) is Jensen's inequality (P e (y)/P e acts as a PDF) and 
(b) holds since P e (y) < 1 and due to splitting the logarithm. 
This completes the proof. ■ 

Appendix F 
Proof of Proposition[T"31 

By construction, r su b(y) = x whenever x € X^ U X b , and 
conversely, r su b(t/) ^ x whenever x ^ X k U X b . This yields 
P e = l-P x (X k UX b ). 
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For the Fano-type bound, we again notice that since the information dimension of an RV cannot exceed the 

covering dimension of its support. Taking the expectation w.r.t. 
H(X\Y) = H(E\Y) + H(X\E,Y). (174) y yields 

The first term can be written as k 

d(X\Y) < J2(N - Mi) / P x \ Y=y (Xi)dPr{v) 

H(E\Y) = / H 2 (P e (y))dPY(y) i=i Jy 

Jy k 

= [ H 2 (P e (y))dPy(y) = 5>-M,)iM*) (183) 



<PY(y k \y b ) (175) and thus 



K N — Mi 



since P e (y) = for y e 34 and P e (y) = 1 for y e y \ (y k U l ( x ~> Y ) < Yl N~^ P ^( Xi ) (184) 



y b ). 



i=l 



For the second term we can write by Proposition [3] It remains to show the reverse inequality. 

p To this end, note that without the Lipschitz condition 

H{X\E,Y) = / H(X\Y = y,E= l)P e (l/)d*V(#76) Proposition g] would read 

Jy 

< f \og{K-\)P e {y)dP Y {y) t(X^Y)<^±. (185) 

But with 1261, 1311 we can write for the information dimension 



'y Oj >.\\i ofY K 



+ f log KdP Y (y) (177) 

■Jy\(y k uy b ) 

Now we note that d(Y) = Y,d{Y\X e X t )P x (X l ). (186) 

n l — l 

P e = P Y (y\(y k uyb))+ P e (y)dP Y (y) (178) „ , ... . , . . f m u t 

xv " l v , yt)J yt " v ' By the fact that g t are submersions, preimages of fi 1Ul -null sets 

' AT 71 i- 

are fi -null sets |66] - were there a \x ^-null set BCj/j such 
which we can use above to get mat the conditional probability measure P Y \xexS B ) > °> 

, , , / - , , , „\ , — , there would be some u N -nu\\ set q^ 1 (B) C X; such that 

H(X\E,Y)<( Pe -P Y (y\(y k uy b )))lo g (K-l) P xlXexM ^B)) > 0, contradict Px<< ^. Thus, 

+ P Y {y\ (y k U %)) log if. (179) Pr\xext < M Ml and we 8 et 
Rearranging and using *(X ^ Y) > 1 - £ f P^) = £ ^Pc^). 

P 6 + Py(3 ; \(yfcUy6))+Py(yfc\W = l (180) <=i 
yields This proves the reverse inequality and completes the proof. ■ 



H(X\Y)<l-P b + P e log(K-l) _ APPENDIX H 

, n rvw-v i i-vw A Proof of Proposition[T5] 

+ Py {y \ {y kuy b ))\ \og = — 

V A — 1 / The proof follows from Proposition |4] Definition |3J and the 

(181) fact that conditioning reduces entropy. We write 

The fact < log < 1 completes the proof. ■ t(X -> Y) = t(X -> Y {1 \ Y {2) , . . . , Y {K) ) (188) 



Appendix G d(X) 



(189) 



Proof of Proposition[T4] hyvW v( K )\ 

= lim H ^ n " I (190) 



We start by noting that by the submersion theorem 1146] n->oo H(X n 



Cor. 5.25] for any point y e J 7 = U l= i the preimage under y^/Y ,»(;), -(i) v>(*-i)\ 

.gl^ is either the empty set (if y <£ y t ) or an (N - M t )- = lim — 1 " 1 " " ^ (191) 

dimensional embedded submanifold of A",;. We now write H{X n ) 

with ma, mo ^ Hm e£i-ht&°) = Eti^ w ) (192) 



71— »-00 



d(x|y = y ) = £d(x|y = y,x £ xopx^xo r w ) (193) 



A" 



< J^(A - MOPyi^^) (182) where we exchanged limit and summation by the same reason 
«=i as in the proof of Proposition [3] ■ 
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Appendix I 
Proof of Proposition[T61 

The proof follows from the fact that d(X) = N and, for 
all i, d(X^) — 1. We obtain from the definition of relative 
information loss 



l(X ->■ Y) = lim 



H(x£\...,X}r>\Y) 



-Wiv(i) 



lim 



ff^ 1 ',...,^ 
< lim 



H(Xn\ . . . , Xn"') 



Exchanging again limit and summation we get 



l(X -> F) < 



iV 



-Ml 



TV 

But since TV = Nd(X^>) for all i, we can write 



(194) 



(195) 



1 N 



N ^ d lW N ^ K ' 

2—1 V 7 1=1 

(196) 

This proves the first inequality. The second is obtained 
by removing conditioning again in ( 1 1 94b , since Y = 

{yw yW}. " " m 

Appendix J 
Proof of Proposition[T71 

Note that by the compactness of X the quantized input 
X n has a finite alphabet, which allows us to employ Fano's 
inequality 



H{X n \Y) < H 2 (P e , n ) + P e , B logcard(P„) 



where 



Pr 



Yv{r{Y)^X n ). 



(197) 



(198) 



Since Fano's inequality holds for arbitrary reconstructors, we 
let r(-) be the composition of the MAP reconstructor TmapO) 
and the quantizer introduced in Section IH-AI Consequently, 
P e , n is the probability that ryiAp{X) and X do not lie in 
the same quantization bin. Since the bin volumes reduce with 
increasing n, P e) „ increases monotonically to P e . We thus 
obtain with H 2 (p) < 1 for < p < 1 



H(X n \Y) < l + P e logcard(P„). 
With the introduced definitions, 

H{X n \Y) 



l(X -> Y) = lim 



H(X n 



(199) 



(200) 



nm l + P e logcard(P n ) 

" n-voo ff(X„) 



(a) 

" ^ d(X) 



(202) 



where (a) is obtained by dividing both numerator and denom- 
inator by n and evaluating the limit. This completes the proof. 



Appendix K 
Proof of Proposition[T81 

Since the partition is finite, we can write with ll26l Thm. 2] 

d(X\Y = y) = J2 d ( x \ Y = V> X ^ Xi)Px\Y=y(X<)- 

(203) 

If g is injective on Xi, the intersection n Xi is a single 

poin0 thus d(X\Y — y,X £ Xi) = 0. Conversely, if g is 
constant on Xi, the preimage is Xi itself, so one obtains 

d(X\Y = y,X £ X t ) = d(X\X € Xi) = P * ^} . (204) 



We thus write 



Px(Xi) 

d(X\Y) = J J2d(X\Y = y ) XeX i )P xlY=y (X i )dP Y (y) 

(205) 

Px\Y= y (Xi)dP Y (y) (206) 

/ Px\Y= y (^)dP Y (y) (207) 
Jy 

(208) 



(a) s—^ PjfjXj) 

= Ma P *W Jy 

(b) 



Px(A) 



where in (a) we exchanged summation and integration with 
the help of Fubini's theorem and (b) is due to the fact that the 
sum runs over exactly the union of sets on which g is constant, 
A. Proposition |3] completes the proof. ■ 
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